A Kernel Two-Sample Test
نویسندگان
چکیده
We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distributionfree tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. ∗. Also at Gatsby Computational Neuroscience Unit, CSML, 17 Queen Square, London WC1N 3AR, UK. †. This work was carried out while K.M.B. was with the Ludwig-Maximilians-Universität München. ‡. This work was carried out while M.J.R. was with the Graz University of Technology. §. Also at The Australian National University, Canberra, ACT 0200, Australia. c ©2012 Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf and Alexander Smola. GRETTON, BORGWARDT, RASCH, SCHÖLKOPF AND SMOLA
منابع مشابه
Optimal kernel choice for large-scale two-sample tests
Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis that p = q, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel use...
متن کاملB-test: A Non-parametric, Low Variance Kernel Two-sample Test
We propose a family of maximum mean discrepancy (MMD) kernel two-sample tests that have low sample complexity and are consistent. The test has a hyperparameter that allows one to control the tradeoff between sample complexity and computational time. Our family of tests, which we denote as B-tests, is both computationally and statistically efficient, combining favorable properties of previously ...
متن کاملB-tests: Low Variance Kernel Two-Sample Tests
A family of maximum mean discrepancy (MMD) kernel two-sample tests is introduced. Members of the test family are called Block-tests or B-tests, since the test statistic is an average over MMDs computed on subsets of the samples. The choice of block size allows control over the tradeoff between test power and computation time. In this respect, the B-test family combines favorable properties of p...
متن کاملA Permutation-Based Kernel Conditional Independence Test
Determining conditional independence (CI) relationships between random variables is a challenging but important task for problems such as Bayesian network learning and causal discovery. We propose a new kernel CI test that uses a single, learned permutation to convert the CI test problem into an easier two-sample test problem. The learned permutation leaves the joint distribution unchanged if a...
متن کاملExponentially Consistent Kernel Two-Sample Tests
Given two sets of independent samples from unknown distributions P and Q, a two-sample test decides whether to reject the null hypothesis that P = Q. Recent attention has focused on kernel two-sample tests as the test statistics are easy to compute, converge fast, and have low bias with their finite sample estimates. However, there still lacks an exact characterization on the asymptotic perform...
متن کاملTopics in kernel hypothesis testing
This thesis investigates some unaddressed problems in kernel nonparametrichypothesis testing. The contributions are grouped around three main themes:Wild Bootstrap for Degenerate Kernel Tests. A wild bootstrap method for non-parametric hypothesis tests based on kernel distribution embeddings is pro-posed. This bootstrap method is used to construct provably consistent teststh...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Journal of Machine Learning Research
دوره 13 شماره
صفحات -
تاریخ انتشار 2012